Analysis the red wine quality

In this project, we analysis the red wine quality. This report explores a dataset containing and attributes for approximately 1600 with 11 variables. All of the variables are continuous variables.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality
##  Min.   : 8.40   3: 10  
##  1st Qu.: 9.50   4: 53  
##  Median :10.20   5:681  
##  Mean   :10.42   6:638  
##  3rd Qu.:11.10   7:199  
##  Max.   :14.90   8: 18
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

From the histogram, it roughly appears the normal distribution with the quality peak around 5 to 6. From the summary, the mean is 5.636. The best quality is 8 and the worst quality is 3 and the median is 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed acidity roughly shows normal distribution after changing the scale to log10. The standard deviation is 1.741 and median are 7.9. Most wine of fixed acidity is from 6 to 10(no unit).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The volatile acidity is slightly like the normal distribution with the continuous scale. But if we change the scale to the log10, it also not appears a perfectly normal distribution. After several time changes, I found the power 0.2 is the best to scale for this variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Transformed the long tail data to better understand the distribution of the chlorides. The transformed chlorides distribution appears normal distribution and the peak around 0.79.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The density range is quite small. Make sense! This variable is also following the normal distribution with continual scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH variable shows more likely the normal distribution with power 0.1 scale and the mean are 3.311 and the min value is 2.74 and the max value is 4.010.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

This variable is also long tail data, changing the variable to log 10 scale appears much better. The median is 0.62 and the mean is 0.6581.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The total.sulfur.dioxide variable is definitely log 10 scale variable. The median is 38.00 and the mean is 46.47 and the range from 6 to 289.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The residual sugar is also log 10 scale variable. The median is 2.200 and the mean is 2.539.

The most common alcohol for different quality between 9 to 11.

The histogram of chlorides, sulphates,total.sulfur.dioxide and residual.sugar are right skewed so I’m going to transform the data using a log transform. The histogram of quality,fixed.acidity,volatile.acidity,density,pH variables are continuous variables. So don’t need to change the variable.

Univariate Analysis

What is the structure of your dataset?

Our data set consists of 13 variables, with 1599 observations. There are 5 variables are continuous scale variable which are quality,fixed.acidity,volatile.acidity,density,pH. There are 4 variables are log10 scale variables which are chlorides,sulphates,total.sulfur.dioxide,residual.sugar. The other variables don’t belong to any type of scale and any type of distribution. The quality from 3 to 8 with the worst to best wine quality. Other observations: * Most qualities are 5 and 6. * The median quality is 6. * The max quality is 8. * The best scale for volatile.acidity is power 0.2

Univariate Plots

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are alcohol, quality. I’d like to determine which features are best for predicting the quality of the red wine. I suspect alcohol,volatile.acidity and residual.sugar can be used to build a predictive model to price diamonds.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The fixed.acidity, volatile.acidity, citric.acid,residual.sugar,chlorides. I think the residual.sugar contribute the most to the quality after reaching the information on quality.

Did you create any new variables from existing variables in the dataset?

This dataset is not well to create a new variable for analysis. I was trying to create some variable, such pH/density, alcohol/density. So only create a variable quality.bucket.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the right skewed price and volume distributions. All the transformed distribution appears normal distribution. I also transformed two variable with power scale to be more normal distribution.

Bivariate Plots Section

For the first, we can run a correlation to the quality variable to select the variables we most care about.

##                                                [,1]
## X                                        0.06645261
## fixed.acidity                            0.12405165
## volatile.acidity                        -0.39055778
## citric.acid                              0.22637251
## residual.sugar                           0.01373164
## chlorides                               -0.12890656
## free.sulfur.dioxide                     -0.05065606
## total.sulfur.dioxide                    -0.18510029
## density                                 -0.17491923
## pH                                      -0.05773139
## sulphates                                0.25139708
## alcohol                                  0.47616632
## fixed.acidity.ratio.volatile.acidity     0.34346313
## free.sulfur.dioxide.ratio.total.dioxide  0.19411335

From the subset of the data, fixed.acidity,citric.acid,residual.sugar,chlorides,total.sulfur.dioxide,density,pH and free.sulfur.dioxide.ratio.total.dioxide don’t seem to have strong correlations with quality. But the volatile.acidity, sulphates,alcohol and fixed.acidity.ratio.volatile.acidity are moderately correlated with carat. I want to look closer at scatter plots involving the quality and some other variable like alcohol,etc.

We found the positive correlation between quality is as following and we also found all median for the different box is increasing. And we can calculate the coefficient for alcohol to quality is 0.476.

We also found the second correlated variable to quality is sulfates. The coefficient for sulfates to quality is 0.251. The median is also increasing.

Here is some of the negtive correlation for quality:

The coefficient for density to quality is -0.175. We can see this line is decrease and the median is also decrease.

From the density to quality, we can also see the negative correlation. The coefficient for density to quality is -0.391. From the box plot, the median is decreased.

##   alcohol volatile.acidity sulphates density
## 1     9.4             0.88      0.68  0.9978
## 
## Calls:
## m1: lm(formula = as.numeric(levels(wins$quality))[wins$quality] ~ 
##     alcohol + volatile.acidity + I(log(sulphates)) + density, 
##     data = wins)
## 
## =================================
##   (Intercept)          3.447     
##                      (10.456)    
##   alcohol              0.303***  
##                       (0.019)    
##   volatile.acidity    -1.156***  
##                       (0.097)    
##   I(log(sulphates))    0.641***  
##                       (0.080)    
##   density             -0.077     
##                      (10.377)    
## ---------------------------------
##   R-squared               0.3    
##   adj. R-squared          0.3    
##   sigma                   0.7    
##   F                     210.4    
##   p                       0.0    
##   Log-likelihood      -1587.8    
##   Deviance              682.1    
##   AIC                  3187.5    
##   BIC                  3219.8    
##   N                    1599      
## =================================
##       fit      lwr      upr
## 1 4.95692 3.671375 6.242464

But we found for the density variable alpha level is not well, so just remove this and create new model for this. Seems like only a little bit change.

## 
## Calls:
## m2: lm(formula = as.numeric(levels(wins$quality))[wins$quality] ~ 
##     alcohol + volatile.acidity + I(log(sulphates)), data = wins)
## 
## ================================
##   (Intercept)         3.369***  
##                      (0.184)    
##   alcohol             0.303***  
##                      (0.016)    
##   volatile.acidity   -1.156***  
##                      (0.097)    
##   I(log(sulphates))   0.641***  
##                      (0.077)    
## --------------------------------
##   R-squared               0.3   
##   adj. R-squared          0.3   
##   sigma                   0.7   
##   F                     280.6   
##   p                       0.0   
##   Log-likelihood      -1587.8   
##   Deviance              682.1   
##   AIC                  3185.5   
##   BIC                  3212.4   
##   N                    1599     
## ================================
##        fit      lwr      upr
## 1 4.956922 3.671782 6.242063

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates strongly with alcohol, volatile.acidity and less correlate with sulfates, citric.acid. As alcohol, volatile.acidity increase, the quality increase. As volatile.acidity increase, quality decrease. All the relation between quality to another variable appears linear.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is strong positive correlation for citric.acid to fixed.acidity, density to fixed.acidity, total.sulfur.dioxide to free.sulfur.dioxide. There are a strong negative correlation for fixed.acidity to pH, citric.acid to pH, alcohol to density.

What was the strongest relationship you found?

The strongest relationship is between fixed.acidity to pH. But it does not make sense for predict the variable we care about the most. The most variable we want to predict is quality. The quality of wine is positively and strongly correlated with alcohol and its coefficient is 0.476. The second correlated variable is volatile.acidity which coefficient is -0.391. The third and fourth correlated variables are sulphates(0.251),citric.acid(0.226) for quality.

Multivariate Plots Section

We can see from the plot above, high-quality wine appears most frequently for low volatile acidity and high alcohol side.

For the hight quality wine most frequently appears in the upper-right corner which means the high quality with high alcohol and high sulphates. We also found for the lower alcohol wine have more range of sulphates.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the hight quality wine most frequently appears in the upper-right corner which means the high quality with high alcohol and high sulfates. We also found for the lower alcohol wine have more range of sulphates. We can see from the plot above, high-quality wine appears most frequently to low volatile acidity and high alcohol side.

Were there any interesting or surprising interactions between features?

Sugar has nothing to do with wine quality. I supposed less sugar is high-quality wine.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

We created one linear model to predict quality with alcohol, volatile.acidity, sulphates and density. The newWine is an example. It predicts that fit is 4.95692 and confidential 0.95 intervals between 3.671375 to 6.242464. The strengths are alcohol have a huge impact on quality. The limitations are that we didn’t involve the variable like the brand, location etc. It may also affect the wine quality.

Final Plots and Summary

Plot One

Description One

The alcohol has a huge impact on the wine quality. We can see from the regression line that one alcohol increase with 0.303 quality increase. There is three line from top to down are quartile 0.9, median and 0.1. For the quality 5, the density is very high.

Plot Two

##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Description Two

The quality shows a normal distribution. Amount is so high for the 5 and 6 quality wine which means is more of the wine below to the 5 or 6 quality. For the quality 5 have 681 wines in the dataset and for the quality 6 have 638 wines. The 1st qu is 5 and 3rd qu is 6. The mean for the quality is 5.636 and the median is 6.

Plot Three

Description Three

As volatile acidity increase, the quality decreases especially from 4 to 5. We can see from regression line, as quality increate one unit the volatile acidity decrease 1.156. For the quality 5 and 6, most of the volatile acidity is from 0.4 to 0.8 according to three quantile lines. Once the quality more than 7, the volatile acidity become horizontal.

Reflection

The red wine dataset contains information on almost 1600 wine across 12 variables. I started by understanding the top 10 individual variables in the data set, and then I explored intereting qustions and leads as continouse to make observations on the plots. Eventually, I explored the quality of the wine across the many variables and created a linear model to predict wine quality.

There was a clear trend between the alchole, sulphates and volatile.acidity its quality. I was surpriced that residual.sugar didn’t have a strong positive correlation with quality. For the linear model, all the wine were included since information on quality, alcohol, volatile.acidity and sulphates. After transforming sulphates to log scale. The model was able to account for 30% of variance of dataset.

The challenges during my analysis are all the variable is continuous that I can not separate them clearly. So I cut quality into the bucket. But it could misleading for the quality and reader may suppose quality is continouse variable. So I have convert the quality as numberic variable. And the second chanllenge during my analysis is to choose the right plot and right variable for the multi variable plot. Because of this dataset is not quite well for creating multi-variable plot. Eventually, I found that this dataset is fit for quality as color with two variables.

There are some limitation for the dataset. It didn’t put some variable like production date, location, brand etc, into consideration. In the future, we may involve more feature and variable like the brand, production location, production date to improve the prediction result.